Skip to content

Improve and re-release chapter 2 #911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Improve and re-release chapter 2 #911

wants to merge 8 commits into from

Conversation

burtenshaw
Copy link
Collaborator

This is a minor improvement to chapter 2 to do these things:

  • remove tensorflow examples and videos
  • add a page on TGI, vLLM, and llama.cpp

@burtenshaw burtenshaw requested a review from sergiopaniego May 12, 2025 08:54
}
]}
/>
# Optimized Inference Deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue in the building process. I think it comes from the framework s tags in this file. I think at least is missing <FrameworkSwitchCourse {fw} />


The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but as you saw in [Chapter 1](/course/chapter1), this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it's imperative to be able to share and reuse models that have already been trained.
You'll notice that the tokenizer has added special tokens — `[CLS]` and `[SEP]` — required by the model. Not all models need special tokens; they're utilized when a model was pretrained with them, in which case the tokenizer needs to add them as that model expects these tokens.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence feels convoluted

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" />

<Tip title="How Flash Attention Works">
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [Chapter 1.8](/course/chapter1/8), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.


**vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient.

<Tip title="How Paged Attention Works">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining consistency in naming:

Suggested change
<Tip title="How Paged Attention Works">
<Tip title="How PagedAttention Works">

@sergiopaniego
Copy link
Member

Super well written and informative, as always @burtenshaw. Great improvement over the previous iteration! Main issue is the failing build process. Rest are just nits 😄

@burtenshaw
Copy link
Collaborator Author

Super well written and informative, as always @burtenshaw. Great improvement over the previous iteration! Main issue is the failing build process. Rest are just nits 😄

Thanks @sergiopaniego . Working on the framework options now.

Co-authored-by: Sergio Paniego Blanco <[email protected]>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Sergio Paniego Blanco <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants